Variable selection through CART

نویسندگان

  • Marie Sauvé
  • Christine Tuleau-Malot
  • Marie Sauve
چکیده

This paper deals with variable selection in regression and binary classification frameworks. It proposes an automatic and exhaustive procedure which relies on the use of the CART algorithm and on model selection via penalization. This work, of theoretical nature, aims at determining adequate penalties, i.e. penalties which allow achievement of oracle type inequalities justifying the performance of the proposed procedure. Since the exhaustive procedure cannot be realized when the number of variables is too large, a more practical procedure is also proposed and still theoretically validated. A simulation study completes the theoretical results. Résumé. Cet article aborde le thème de la sélection de variables dans le cadre de la régression et de la classification. Il propose une procédure automatique et exhaustive qui repose essentiellement sur l’utilisation de l’algorithme CART et sur la sélection de modèles par pénalisation. Ce travail, de nature théorique, tend à déterminer les bonnes pénalités, à savoir celles qui permettent l’obtention d’inégalité de type oracle. La procédure théorique n’étant pas implémentable lorsque le nombre de variables devient trop grand, une procédure pratique est également proposée. Cette seconde procédure demeure justifiée théoriquement. Par ailleurs, une étude par simulation complète le travail théorique. Mathematics Subject Classification. 62G05, 62G07, 62G20. Received December 4, 2012. Revised December 26, 2013.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Selection of models for the analysis of risk-factor trees: leveraging biological knowledge to mine large sets of risk factors with application to microbiome data

MOTIVATION Establishment of a statistical association between microbiome features and clinical outcomes is of growing interest because of the potential for yielding insights into biological mechanisms and pathogenesis. Extracting microbiome features that are relevant for a disease is challenging and existing variable selection methods are limited due to large number of risk factor variables fro...

متن کامل

Basis-Function Trees as a Generalization of Local Variable Selection Methods

Local variable selection has proven to be a powerful technique for approximating functions in high-dimensional spaces. It is used in several statistical methods, including CART, ID3, C4, MARS, and others (see the bibliography for references to these algorithms). In this paper I present a tree-structured network which is a generalization of these techniques. The network provides a framework for ...

متن کامل

A bias correction algorithm for the Gini variable importance measure in classification trees

This paper considers a measure of variable importance frequently used in variable selection methods based on decision trees and tree-based ensemble models, like CART, Random Forests and Gradient Boosting Machine. It is defined as the total heterogeneity reduction produced by a given covariate on the response variable when the sample space is recursively partitioned. Some authors showed that thi...

متن کامل

Basis-Function Trees as a Generalization of Local Variable Selection Methods for Function Approximation

Local variable selection has proven to be a powerful technique for approximating functions in high-dimensional spaces. It is used in several statistical methods, including CART, ID3, C4, MARS, and others (see the bibliography for references to these algorithms). In this paper I present a tree-structured network which is a generalization of these techniques. The network provides a framework for ...

متن کامل

CART-based selection of bankruptcy predictors for the logit model

Balance-sheet data offer a potentially large number of candidate predictors of corporate financial failure. In this paper we provide a novel predictor selection procedure based on non-parametric regression and classification tree method (CART) and test its performance within a standard logit model. We show that a simple logit model with dummy variables created in accordance with the nodes of es...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006